-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETag cache validation patches #137
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Python2 code now must explicitly execute "python2", as the default for the unqualified "python" executable has changed in recent releases to "python3".
Existing riofs caches only support up to 4 GBytes of cached data, as cache_dir_max_size is a 32 bit unsigned integer count of the maximum cache size, in bytes. Extend that by a factor of a million (1024 * 1024, actually). Existing riofs configurations still work, as is, with no change in allowed cache size. A new configuration option can optionally be used, to override the existing setting, with a count of how many MBytes, instead of how many bytes, are allowed in the max cache directory size. Remove "filesystem.cache_dir_max_size" from the list of config settings that must be set on riofs startup, because now either that setting, and/or the new setting of "filesystem.cache_dir_max_megabyte_size" may be set (with the latter one taking precedence, if set.) The code that determines the configured size will complain if neither option is set.
The dir_tree_entry_update_xattrs() routine is only used and is defined in src/dir_tree.c, so mark it static to that file.
Add (void *) cast to various LOG_err and LOG_debug print arguments, so that printing them with a "%p" format doesn't generate an error when compiling with gcc -pedantic enabled. Various other tweaks to get printf formats to match the value types, on both 32 and 64 bit architectures.
Add --enable-strict-compile configuration option to enable support for strict compiler warnings. With a few patches just before this, riofs now compiles cleanly with CFLAGS -Wextra -pedantic -Wall (except for some older gcc versions that don't know about, so ignore, the GCC diagnostic push and pull pragmas, used to suppress a variadic-macros warning from include/log.h) Since the default has been to build riofs without these strict flags, I left that as the default. One has to use the command: configure --enable-strict-compile to create Makefile's that include -Wextra -pedantic -Wall in CFLAGS.
Since the patches that follow this patch make a fairly substantial change to how riofs recognizes that a cached file has changed on the Amazon AWS S3 server (thus requiring invalidating the local cache), I have therefore bumped the riofs version number.
Some key code in the fileio_read_on_head_cb() routine was serving a dual purpose: 1) Set the file's size from the http header Content-Length 2) Perform one of the consistency checks, this one for size change After adding the ETag consistency checking code in the next patch, I will then be able to remove the other, less reliable, methods of checking, including checking on file size. However the setting of the file's size from the http header is still needed. Therefore in this patch I split up this code, into two separate parts, one of which will remain, and the other of which will be removed in a few patches.
This is the one very important patch in this series. The proper way, in my understanding, to check whether the file we have locally cached is still the same file as on the AWS S3 server is to compare ETag's. Anytime we notice that our cached file no longer has the same exact contents as the similarly named file on the AWS S3 server, we need to invalidate our cache and download the new file contents as need be. AWS S3 guarantees (to the level of confidence of MD5 sums) that if and only if the content of a stored file changes, then its ETag changes. There is no need to compute MD5 sums locally (which aren't the same as ETags on multipart files anyway) to make this check. If the ETag hasn't changed, we're still dealing with the same file contents, even if we have only cached a few parts of it so far. Other ways of checking, such as on file size or version or calculated MD5 sums for single part files, are quite less reliable, subject to both false positives and false negatives, and not always applicable, if for example a file is not versioned on Amazon, or a file has multiple parts, or a file's contents changes but its size remained the same. This patch makes the following changes: 1) Add an etag string to CacheEntry, to track what ETag is currently in the cache for that inode. 2) Add get and update routines, to access that etag string. 3) Add an aws_etag string to FileReadData, to keep track of what was reported in the latest ETag header field from the AWS S3 server for that file, so that once the corresponding CacheEntry is available as well, we can get, update, and compare the two etags, and invalidate the cache if necessary. 4) Add insure_cache_etag_consistent_or_invalidate_cache() routine, called from two places, depending on the sequencing of events, to handle this ETag checking. 5) The insure_cache_etag_consistent_or_invalidate_cache() routine returns TRUE if it was able to make the check and (if necessary), invalidate the cache. It returns FALSE if and only if it encountered an error along the way. 6) The previously unused evkeyvalq argument to the fileio_read_on_get_cb() routine is now used, as the ETag header is needed in that routine. 7) FileReadData *rdata can no longer be free'd using a single g_free() call, as it might contain an allocated aws_etag. So add a fileread_destroy() routine to handle free'ing it. 8) Add a cache_etag_is_set flag to FileReadData, to keep track of whether we've set the etag string in the associated (via the ino number) CacheEntry. On newly cached files, setting up the cache occurs some steps after setting up FileReadData, so we need to keep track of whether we've set the cache's etag or not. 9) A few other minor changes in related comments and spacing. This patch provides a potentially substantial performance improvement for a particular use case. If a new large file was placed on an riofs managed server, and multiple clients learn of it and all go to download it at nearly the same (downloads overlap in time), then dreadful cache thrashing could occur, as the cache kept being invalidated when another client came along to start a download, and the size of whatever had been stored in the cache so far did not match the total size of what was on the AWS S3 server. Repeated partial downloads of the same file could result in S3 bandwidth charges that were one or two orders of magnitude larger than needed for a single download of such a file.
Two of the three existing consistency checking methods used in fileio_read_on_head_cb() required some additional supporting code elsewhere, that had no other use. So remove each of these methods, along with any now unneeded supporting code, one at a time. This patch removes the first consistency checking method, using md5 sums.
This patch removes the second no longer used consistency checking method from fileio_read_on_head_cb(), along with what seems to be a substantial amount of no longer used supporting code.
This is the easiest of the three old consistency checking methods to remove - based on file size checking.
Cool, I'll review your patches in the next few days (will try to do this task asap) and let you know. |
wizzard
approved these changes
Jan 22, 2017
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything look good! Thank you!
Thanks for pulling my patches, and thanks for a fine piece of software.
…-- Paul Jackson pj@usa.net
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As I have written at more length in #136, this set of patches, which replaces the methods of checking whether the cache is still valid with a single method using ETags, is now ready for consideration to pull into the main branch.
See in particular patch "3ee963e ETag fix #3: add ETag consistency checking" for the most interesting details of this change.
Thanks!